In [23]:
%matplotlib inline

from scipy.stats import binom, norm, poisson, hypergeom, 
                        rv_discrete

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

Administrative Stuff

Connect to the Jupyter server that I have created on Amazon EC2. URL is:

https://54.172.168.134:8888/

Password is: "LifeIsGood!"

When in there, create a new notebook, and then click on the "Untitled" at the top, and rename it to your own name.

Learning Outcomes:

By the end of this section, you will be able to:

  1. Use Python functions to visualize the Normal, Binomial, Hypergeometric, and Poisson distributions.
  2. Describe the meaning of a p-value.
  3. Describe the difference between the population and sample distribution.
  4. Describe the Central Limit Theorem and its use in inferential statistics.

Inferential Statistics

In the first part, we learned about descriptive statistics, and common plots used to visualize the data.

In this section, we will learn about inferential statistics, which when boiled down to its core, is all about figuring out how probable our data (or more extreme) could have been under random chance.

The Logic of Inference

We observe some data. We want to know what probability we could have observed that data or something more extreme, purely by random chance.

To do so, we need a few ingredients:

  1. A test statistic - a summary statistic of the data. Usually the mean or variance.
  2. A null random model - essentially one of the distributions taught by Sean, and other non-parametric distributions.

Inspired by @jakevdp (Jake Vanderplas, UW eScience Institute), we will not be using analytical methods (i.e. algebra) in this segment, but instead we will use computational methods.

Example 1: The probability of getting that result or something more extreme

Here is a datapoint: We have flipped a coin 20 times, and it came out to be heads 16 times. Is this coin biased or not?

  1. Firstly, let's define the test statistic as "the number of heads flipped".
  2. Secondly, let's define the null random model as "20 coin flips where the probability of heads is 0.5".

In [87]:
null_flips = binom.rvs(n=20, p=0.5, size=10000)
plt.hist(null_flips)
plt.axvline(16)


Out[87]:
<matplotlib.lines.Line2D at 0x11d0a9e10>

In [2]:
alpha = 5 / 100

null_flips = binom.rvs(n=20, p=0.5, size=10000)
plt.hist(null_flips)
plt.axvline(16)


Out[2]:
<matplotlib.lines.Line2D at 0x1152337f0>

In [90]:
sum(null_flips >=16) / 10000


Out[90]:
0.0057000000000000002

Our test here tells us whether it is possible for us to see X number of heads or more under randomness. Let us count up the number of times we see 16 or more heads under the null model.


In [3]:
num_heads = sum(null_flips >= 16)
num_heads


Out[3]:
79

Finally, let's calculate the probability of seeing this number by random chance.


In [4]:
p_value = num_heads / len(null_flips) 
p_value


Out[4]:
0.0079000000000000008

The probability of seeing this under a null model of p=0.5 is very low (approximately 1 in 200). Therefore, it is likely the case that the coin is biased.

Exercise 1

Epidemiologists have been longitudinally measuring the rate of cancer incidence in a population of humans. Over the first 10 years, cancer rates were 10 per 10000 by the age of 30. In the 11th and 12th years, cancer rates went up to 14 per 10000. Calculate the p-value of the 11th year and the 12th year separately.

Hint: Model this as a Poisson process.

What are you assuming about the null model?


In [94]:
null_poisson = poisson.rvs(10, size=100000)
sum(null_poisson >= 14) / 100000


Out[94]:
0.13552

Visualize this using the plot below.


In [6]:
plt.hist(null_poisson)
plt.axvline(14)


Out[6]:
<matplotlib.lines.Line2D at 0x1041896a0>

Exercise 2

In your average bag of M&Ms, there are on average 30 brown, 20 yellow, 20 red, 10 orange, 10 green, and 10 blue out of every 100 M&M candies. I draw out 50 candies, and find that I have 22 brown. Is this bag an "average" bag?


In [32]:
null_hypergeom = hypergeom.rvs(M=100, n=30, N=50, size=10000)
sum(null_hypergeom >= 22) / 10000


Out[32]:
0.0012999999999999999

Once again, plot this below.


In [33]:
plt.hist(null_hypergeom)
plt.axvline(22)


Out[33]:
<matplotlib.lines.Line2D at 0x118aa7a90>

Exercise 3

Systolic blood pressure in healthy patients is Normally distributed with mean 122 mmHg and std. deviation 9 mmHg. In hypertensive patients, it is Normally with mean 140 and std. deviation 6 mmHg. (I pulled out these numbers from a hat - http://www.cdc.gov/nchs/data/nhsr/nhsr035.pdf)

A patient comes in and measures 130 mmHg.

What is the p-value of this patient's blood pressure, under:

  • the hypothesis that this patient is healthy?
  • the hypothesis that this patient is hypertensive?

Plot the distributions and the patient's systolic blood pressure.


In [9]:
null_healthy = norm.rvs(122, 9, size=100000)
null_hyper = norm.rvs(140, 6, size=100000)

sns.kdeplot(null_healthy)
sns.kdeplot(null_hyper)
plt.axvline(130)


Out[9]:
<matplotlib.lines.Line2D at 0x1172ffeb8>

In [10]:
# P-value calculation under healthy hypothesis
np.sum(null_healthy >= 130) / len(null_healthy)


Out[10]:
0.18745999999999999

In [11]:
# P-value calculation under hypertension hypothesis
np.sum(null_hyper <= 130) / len(null_hyper)


Out[11]:
0.04734

Example 2: Difference between two experimental groups.

You've measured enzyme activity under two inhibitors and a control without an inhibitor. Is the difference a true difference, or could we have observed this purely by random chance? Take a look at the data below.


In [12]:
# inhibitor1 = norm.rvs(70, 22, size=20)
# inhibitor2 = norm.rvs(72, 10, size=30)
# control = norm.rvs(75, 20, size=15)
# data = pd.DataFrame([inhibitor1, inhibitor2, control]).T
# data.columns = ['Inhibitor 1', 'Inhibitor 2', 'Control']
# data.to_csv('drug_inhibitor.csv')
data = pd.read_csv('drug_inhibitor.csv', index_col=0)  # 
data.mean().plot(kind='bar', yerr=data.std())
data.head(10)  # this displays only the first 10 rows. Take out ".head()" to display a larger view of the data.


Out[12]:
Inhibitor 1 Inhibitor 2 Control
0 47.614170 83.833372 53.489627
1 55.935495 59.976393 56.806193
2 52.248938 67.662847 36.920265
3 84.150209 65.104755 88.682270
4 78.836532 54.887494 86.140043
5 64.303041 70.019334 111.121237
6 69.953035 74.417726 90.671457
7 76.609832 53.116692 109.814540
8 38.909090 77.475146 93.140039
9 67.071579 74.679800 95.721973

Okay, so here's the important, potentially billion-dollar market question: Is inhibitor 1 any good?

Basically, the question we're asking is this: Is the there a difference between the enzyme's activity under inhibitor 1 vs. the control?

Here are a few problems we face here:

  1. We don't know what distribtion the inhibitor and the control data came from. You might be tempted to assume a Normal (N~(mu,sd)) distribution around the mean of each group, but you'd have to justify why this is a rational choice.
  2. The error bars come from the sampled data, not from the so-called "population" data.

Is there a way around this?

Permutation

We can perform a permutation test. Here's the setup.

  • we want to know whether two groups are statistically different or not, BUT
  • we do not have a rationally-defensible underlying distribution to model the data explicitly,

Therefore...

  • assume that the treatments do not matter
  • compute what the distribution of difference of menas will be like if the treatment labels did not matter.
  • then, calculate the probability that two populations are different under this permutation distribution assumption.

In [29]:
import numpy as np
differences = []
for i in range(1000):
    # Select out the two columns of data of interest
    means = data[['Inhibitor 1', 'Control']]
    # Select 
    means = means.apply(np.random.permutation, axis=1).mean()
    difference = means['Control'] - means['Inhibitor 1']
    differences.append(difference)
differences = np.array(differences)
plt.hist(differences, label='Null')
plt.axvline(data.mean()['Control'] - data.mean()['Inhibitor 1'], color='red', label='Data')
plt.legend()


Out[29]:
<matplotlib.legend.Legend at 0x118980f60>

In [14]:
data_diff = data.mean()['Control'] - data.mean()['Inhibitor 1']
np.sum(differences >= data_diff) / len(differences)


Out[14]:
0.16

The probability of seeing this or more difference, assuming that the inhibitor treatment has no effect, is about 3 out of 20 times. We might not be so convinced that this is a good drug.

Notes

Once again, we have chosen a relevant statistic (difference of means), and computed its distribution under a "null" distribution, and compared the statistic provided by the data to the null model.

The Sampling Distribution

In this section, we will learn:

  • What model fitting is all about.
  • The difference between the "underlying" parameter, from the population, and the "sample" (or "measured") parameter.

What is model fitting all about? It's basically about finding the parameter set that best describes your data.

Let's start by taking a random sample of data from a Normal(10, 2) distribution.


In [15]:
sample = norm.rvs(10, 0.5, size=6)  # we will draw 6 instances from the normal distribution.
sns.kdeplot(sample, color='blue')


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x117c24128>

In [27]:
population = norm.rvs(10, 0.5, size=10000)
sns.kdeplot(sample, color='blue', label='sample')
sns.kdeplot(population, color='red', label='population')
plt.legend()


Out[27]:
<matplotlib.legend.Legend at 0x118297fd0>

Create a new distribution called fitted, and tweak its parameters so that it fits the sample distribution.

Discussion

  • What are your parameter values?
  • What's happening here? Isn't the sample drawn from the same distribution as the population?
  • How could the sample parameters and population parameters be different?

The key point here is that the samples are random samples of the population. The computed mean from the sample may not necessarily match with the actual population mean.

Central Limit Theorem

The central limit theorem is central to statistics.

Quoting from Wikipedia:

In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.

We're going to experimentally look at what this statement means.

A crazy dice

Let's say that we have a crazy dice with an underlying distribution that we'd never expect to encounter.


In [17]:
xs = range(1,7)  # numbers 1 through 6
ps = [0.4, 0.1, 0, 0.2, 0.3, 0]
crazy_dice = rv_discrete(name='crazy_dice', values=[xs, ps])

Instructor demo: Let's visualize this distribution.


In [18]:
crazy_dice.pmf(xs)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(xs, crazy_dice.pmf(xs), 'ro', ms=12)
ax.vlines(xs, [0]*len(xs), crazy_dice.pmf(xs), color='red', lw=4)


Out[18]:
<matplotlib.collections.LineCollection at 0x117fb2588>

With this dice, we are essentially saying that we expect the dice to throw 1s (40% of the time), 5s (30% of the time), 4s (20% of the time), and 2s (10% of the time), with no expectation of getting 3s and 6s.

Now, let's roll the dice 4 times. What's the mean?


In [78]:
# Class activity
sample1 = crazy_dice.rvs(size=50)
sns.kdeplot(sample1)
np.mean(sample1)


Out[78]:
2.8599999999999999

Roll the dice 4 times again. What's the mean?


In [20]:
# Class activity
sample2 = crazy_dice.rvs(size=4)
np.mean(sample2)


Out[20]:
2.75

Roll the dice 4 times, but repeat 1000 times now. What's the distribution of the means?


In [84]:
# Class activity
means = []

for i in range(1000):
    mean = np.mean(crazy_dice.rvs(size=4))
    means.append(mean)
    
# plt.hist(np.array(means), color='green')
sns.kdeplot(np.array(means), color='blue')  # plot the density plot
sns.kdeplot(norm.rvs(2.9,0.6,size=100000), color='red')  # plot an approximation of the actual distribution


Out[84]:
<matplotlib.axes._subplots.AxesSubplot at 0x11cd8b128>

In [ ]:

Try calculating by hand the expected value of the dice. The definition of the expected value is:

$$\sum_{i=1}^{n}X_i p_i$$

Or in plain English, sum up the product of each value and its probability.

Did you get 2.9?


In [ ]:

Conclusions

What is statistics all about?

  1. Descriptive statistics: Describing and summarizing data. The questions you are asking here are:
    1. How do I show what the data look like?
    2. How do I condense all the data points into a summarized view?
  2. Inferential statistics:
    1. Figuring out whether what you've observed (your data) could have come from random chance or not.
    2. Figuring out what model of randomness your data could have come from.

Also, the Central Limit Theorem explains much about what we know about the world.

Note on parametric vs. non-parametric tests: use parametric when you have a known mechanistic process. More accurate results when the model correctly describes the underlying process, compared to non-parametric.

Resources

  1. Think Stats: http://greenteapress.com/thinkstats/
    1. Allen Downey's other books are also great resources.
  2. Wikipedia Articles: Go get buried in the web that is Wikipedia. :)
    1. The p-value: https://en.wikipedia.org/wiki/P-value
    2. Statistical inference: https://en.wikipedia.org/wiki/Statistical_inference
    3. Statistical hypothesis testing: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing
    4. Khan academy: https://www.khanacademy.org/math/probability/statistics-inferential/sampling_distribution/v/central-limit-theorem

In [ ]: